1 Introduction

This stakeholderreport is made in html and the code is hidden for you, just as the report.

In this project we will work with a dataset of 5.000 consumer reviews for a few Amazon electronic products like f. ex. Kindle. Data is collected between September 2017 and October 2018.

For this report we will only use a subset of the data we downloaded from kaggle. We select the following variables:

  • id: An id number given to each review created by us corrensponding to the row number of the raw data.

  • name: The full name of the product

  • reviews.rating: The rating of the product on a scale from 1-5.

  • reviews.title: The title of the review, given by the customer.

  • reviews.text: The review text written by the customer.

As the data is very raw and messy we now do some cleaning. We remove everything that is not normal letters. We also don’t want to analyze the exact word strings in the reviews, as this would include several possible forms of the words used. F. ex. think and thought. Instead we want to merge all possible forms of a word into it’s root word. This we do with lemmatization. We here want to primarily work with tidy text, where there is one token per row. A token here is a single word.

We now have 153.994 tokens, in their each seperate rows in the tokens dataset. By doing lemmatization the number of unique tokens are reduced from around 6000 to around 4600 words, which should prove quite beneficial.

2 Network analysis

In this assignment we want to use network analysis to gain new insights into how the reviews are structured. Here we extract bigrams from each review text, clean and prepare them to then create networks. Where we before considered tokens as individual words, we can create them as n-grams that are a consecutive sequence of words. Bigrams are n-grams with a length of 2 consecutive words. This can be used to gain context and connection between words.

Remember that each bigram overlap, as can be seen from above, so that the first token is “the display” and the second is “display is”. We also remove stopwords such as “i”, “am”, “the” and ect. from the bigrams, as these are not meaninfull for the analysis. These are taken from a dictionary and should remove most stopwords, but are not exhaustive.

The interesting thing is now to visualize the relationship between all words. Before doing this we will need to create the graph from a data frame of the bigrams. Here nodes are the words, and the edges correspond to the connection between the two words in the bigram. The first word in the bigram is the column ‘from’, and the second word is ‘to’, and it’s therefore a directed network. The edges are given a weight corresponding to how many times it occures in the total amount of reviews called ‘n’. The weight is plotted as the alpha value, so more frequent bigrams have a darker colour, and vice versa. Only bigrams that occure more than 15 times are plotted in this network, as it otherwise would get to messy.

bigrams_separated <- bigrams %>% 
  separate(bigram,c("word1","word2"),sep = " ")

bigrams_filtered <- bigrams_separated %>% 
  filter(!word1 %in% stop_words$word) %>% 
  filter(!word2 %in% stop_words$word)

#New bigram counts
bigram_counts <- bigrams_filtered %>% 
  count(word1, word2, sort = TRUE)

set.seed(123)
bigram_graph <- bigram_counts %>% 
  filter(n > 15) %>%  #The occurence of the bigram is more than 15. 
  graph_from_data_frame()

a<- grid::arrow(type = "closed",length = unit(.15,"inches"))

ggraph(bigram_graph, layout = "fr") + 
  geom_edge_link(aes(edge_alpha = n), show.legend = FALSE,
                 arrow = a, end_cap = circle(.05,'inches'))+
  geom_node_point(color = "pink", size = 3) + 
  geom_node_text(aes(label = name),vjust=1,hjust=1) + 
  theme_void()

The plot above, give us some insights about the connection of words in the reviews. If we where to chose a random word in the graph, the most likely word to come afterwards would be the outgoing connection with the darkest colour. This way we can kinda predict what words that come next. Remember that the words have been lemmatized, so it shows the root word so the sentence created would not be grammatical correct, but would still carry the meaning as a whole.

We see many small connections such as customer -> service, sound -> quality, black -> friday and ect. Then we also have a bigger cluster where love is one of the key words. Many words such as kid, daughter, son, wife ect. point in the direction of love, and then outgoing edges from love is play, watch, alexa. Creating sentences such as “wife love alexa” or “kid love play”. So first we have the person, then the sentiment word love, and then the action they do or what they love. We see that amazon is a central word with many outgoing connections, as many things are called “amazon prime”, “amazon account” ect. Other key nodes are the product names such as “fire”, “kindle”, “hue”.

3 NLP

In this section we will analyze the data using Natural Language Processing. Here we will gain insight in the dataset by extracting and identifing patterns. We will start by doing some data preprocessing before we start our analysis. Then we will do a sentiment analysis, where we analysis the words and whether they’re positive or negative. Next we will do a LSA analysis, where we will apply dimentionality reductions and cluster the data to look for latent patterns for the words and documents. And then we will do a LDA analysis to try to seperate the data into topics to look for patterns for the words.

Before doing the sentiment analysis, we will quickly look a the distribution of the review ratings.

summary(tokens_nlp$reviews.rating)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   1.000   4.000   5.000   4.533   5.000   5.000

Here, we can see that there is a overepresentation of positive reviews, where the mean is at 4.533 and the median at 5.00. This will contribute to how we do the rest of the sentiment analysis. There is 1134 one-star review rating, 584 two-star review rating, 3028 three-star review rating, 12070 four-star review rating and 30383 five-star review rating.

3.1 Sentiment analysis

Sentiment analysis refers to a use of text analysis to extract and identify subjective information, where it analyzises whether the words are positive or negative. In this section, we will be doing two sentiment analysis, first by identifying positive and negative words using the bing lexicon and after this using the afinn lexicon.

3.1.1 Bing

The Bing lexicon categorizes words in a binary fashion as positive or negative with no weighting. We are now plotting a word count, grouped by sentiment, showing the 10 most frequent negative and positive words.

sentiment_bing <- tokens_nlp %>% inner_join(get_sentiments("bing"))

sentiment_analysis <- sentiment_bing %>% 
  filter(sentiment %in% c("positive", "negative"))

word_counts <- sentiment_analysis %>%
count(word, sentiment) %>%
group_by(sentiment) %>%
top_n(10, n) %>%
ungroup() %>%
mutate(
word2 = fct_reorder(word, n))

ggplot(word_counts, aes(x = word2, y = n, fill = sentiment)) +
geom_col(show.legend = FALSE) +
facet_wrap(~ sentiment, scales ="free") +
coord_flip() +
labs(title ="Sentiment Word Counts",x ="Words")

Here we can see the positive words are much more frequent than the negative words. For positive the words “love” and “easy” is way more frequent than all other words.

And now we will count all positive and negative words for each number of stars.

tokens_nlp %>% 
  inner_join(get_sentiments("bing")) %>%
  count(reviews.rating, sentiment)

From the above table we can see that all categories of reviews both include positive and negative words. Even in 1 star rating reviews, almost a third of all sentiments words are positive. This might indicate that people use negative expressions such as “this is not very good”, and then the only sentiment word is good which classifies as positive, but in reality the sentiment should be negative. This could be fixed with further analysis to remove the effect of negatives, but is not within the current time scope of this project. We also see that 2 star reviews have almost equal negative and positive sentiment words, which results in a almost neutral sentiment.

If we sum the above sentiment words, we can see there is a total of 9.352 sentiment words in our data out of a total of 52.244 words, which means that around 20% of all words left is sentiment words. This means that most words are still some form of stopword, neutral sentiment word or a product word. In the bing lexicon, there’s around 6000 unique sentiment words and in our data there is 345 unique sentiments words from the bing lexicon.

Now we will find the overall sentiment score for every review rating, taking the positive sentiments and subtracting the negative. Then we take the mean and plot it.

tokens_nlp_bing <- tokens_nlp %>%
  inner_join(get_sentiments("bing")) %>%
  count(reviews.rating, sentiment) %>%
  spread(sentiment, n) %>%
  mutate(overall_sentiment = positive - negative)

n1 <- tokens_nlp %>% filter(reviews.rating == 1)
s1 <- tokens_nlp_bing$overall_sentiment[1] / count(n1)
n2 <- tokens_nlp %>% filter(reviews.rating == 2)
s2 <- tokens_nlp_bing$overall_sentiment[2] / count(n2)
n3 <- tokens_nlp %>% filter(reviews.rating == 3)
s3 <- tokens_nlp_bing$overall_sentiment[3] / count(n3)
n4 <- tokens_nlp %>% filter(reviews.rating == 4)
s4 <- tokens_nlp_bing$overall_sentiment[4] / count(n4)
n5 <- tokens_nlp %>% filter(reviews.rating == 5)
s5 <- tokens_nlp_bing$overall_sentiment[5] / count(n5)

x <- c(s1,s2,s3,s4,s5)

ggplot(tokens_nlp_bing, aes(x = reviews.rating, y = x, fill = as.factor(reviews.rating))) + geom_col(show.legend = FALSE) + coord_flip() +
labs(title = "Overall Sentiment by Review rating",
     subtitle = "Reviews",
     x = "Review rating",
     y = "Overall Sentiment")

The scores are generally very low, because only 20% of all words in the reviews are sentiment words, and that positive and negative sentiments are averaged out. After taking all of this into account then a 5 star review on average only have around 0,14 positive sentiment, where a value of 1 would be one more positive sentiment word than negative. As we know that we have 9.352 sentiment words in our data, and 5000 reviews, this means that in total we have only around 2 sentiment words pr. review. So it would be hard to predict the amount of stars in the review only based on this bing sentiment, but it could assist in supervised ml.

3.1.2 Afinn

Now, we will analyze the data using the afinn lexicon, which gives every word a score between -5 and 5. Here 5 is very positive, and -5 is very negative. We are again using the function get_sentiment to get a specific sentiment lexicon and inner_join to join the lexcon with tokenized data. After this we can summarize the value of each review rating, take the mean value and plot it.

# get_sentiments("afinn") # used to download afinn package
sentiment_afinn <- tokens_nlp %>%
  inner_join(get_sentiments("afinn")) %>%
  group_by(reviews.rating) %>%
  summarize(sentiment = sum(value)) %>%
  arrange(sentiment)

n1 <- tokens_nlp %>% filter(reviews.rating == 1)
s1 <- sentiment_afinn$sentiment[1] / count(n1)
n2 <- tokens_nlp %>% filter(reviews.rating == 2)
s2 <- sentiment_afinn$sentiment[2] / count(n2)
n3 <- tokens_nlp %>% filter(reviews.rating == 3)
s3 <- sentiment_afinn$sentiment[3] / count(n3)
n4 <- tokens_nlp %>% filter(reviews.rating == 4)
s4 <- sentiment_afinn$sentiment[4] / count(n4)
n5 <- tokens_nlp %>% filter(reviews.rating == 5)
s5 <- sentiment_afinn$sentiment[5] / count(n5)

x <- c(s1,s2,s3,s4,s5)

ggplot(tokens_nlp_bing, aes(x = reviews.rating, y = x, fill = as.factor(reviews.rating))) + geom_col(show.legend = FALSE) + coord_flip() +
  labs(title = "Mean of Sentiment Score for All Review Ratings",
       subtitle = "Reviews",
       x = "Review rating",
       y = "Mean Score of Sentiment")

What’s interesting here is that all reviews, except one star rating, have a positive score. Of course this could also be, because it categorizes negative words as positive words or vice versa, just like we discussed in the bing sentiment analysis.

3.2 LSA

Latent Semantic Analysis or simply LSA is a techique to identify and analyze the cooccurrences of words across documents. Coorccurrence suggest that the words are somewhat correlated, either by being synonymous or reflect a shared concept. Examples of shared concepts could be colors or cities. We want to extract meanings between documents and words, assuming that words that are close in meaning will appear in similar pieces of texts.

ggplotly(x)

Here, we can plot the features, the words, here in a two dimensional plot. Here, the function in R has reduced the number of dimensions in the data set using the latent features of the data. It clusters and assigns a probability for each data point, which is a probability of a data point within its cluster, which runs from 0 to 1.

Here, there’s two different clusters and 91 outliers. One of the clusters is quiet big and have 3240 out of the 3542 features. The others are smaller and there’s 91 outliers, which doesn’t have a cluster. There could be lot more clusters, because each cluster should have a minimum of 100 features, but the function only makes two clusters. A lot of the words cluster together, as we can see, where the main group (blue cluster) has almost all of the features assigned to it.

3.2.1 Document analysis

Now, we will move on to analyzing the reviews and how they cluster.

ggplotly(x)

Here the documents cluster differently than the features and they are much more spread out, even when the minimum points for each cluster is the same. There’s also more outliers compared to the last plot. Here there’s six clusters, which basically could be the five different ratings the documents are clustered after, but it’s latent features.

3.3 LDA

Linear Discriminant Analysis (LDA) is a method used to find linear combinations of features that characterizes or seperates two or more classes of objects or events. LDA is closely related to Principal Component Analysis (PCA).

Next we perfome a LDA. Beta in an output of the LDA model. Beta indicates the probability that a word occurs in a certain topic.

lda_beta %>%
  mutate(term = reorder_within(term, beta, topic)) %>%
  group_by(topic, term) %>%
  arrange(desc(beta)) %>%
  ungroup() %>%
  ggplot(aes(term, beta, fill = as.factor(topic))) +
  geom_col(show.legend = FALSE) +
  coord_flip() +
  scale_x_reordered() +
  labs(title = "Top 10 terms in each LDA topic",
       x = NULL, y = expression(beta)) +
  facet_wrap(~ topic, ncol = 2, scales = "free")

Above the top 10 terms in each LDA topic are displayed. We choose the number of two clusters since choosing a higher number results in the same words displayed in two or more clusters.“love” is the word with the highest probability of occuring in topic 1, while “tablet” is the word with the highest probability of occuring in topic 2.

It seems like cluster 2 contains some words with a tecnological character (echo, screen, app, alexa, device) while cluster 1 seems related to books/reading (read, book, kindle) and positive words (love, easy).

4 Supervised machine learning

In this section we will analyse whether we can predict if the review is good, here categorized as rating 4 or 5. We will apply supervised machine learning to do so and create a binary logistic model.

We want to dig more deeply into the understading of our model. First we want to investigate which predictors there are driving the model. To do this we will check the coefficients of the models, with the largest value of Lambda.

coefs %>%
  group_by(estimate > 0) %>%
  top_n(15, abs(estimate)) %>%
  ungroup() %>%
  ggplot(aes(fct_reorder(term, estimate), estimate, fill = estimate > 0)) +
  geom_col(alpha = 0.8, show.legend = FALSE) +
  coord_flip() + geom_hline(aes(yintercept=0)) +
  labs(
    x = NULL,
    title = "Coefficients that increase/decrease probability the most"
  )

The graph above gives an overview, over which coefficients that either increase or decrease the probalitiy of the models prediction of the rating.

Further more we want to investigate the confusion matrix of the model. Here we want to make a classification, which will tell how many predictions were true.

confusion_matrix
##        
##         FALSE TRUE
##   FALSE     3   77
##   TRUE      1 1157

As the result shows us. The model could predict 1154 true out the original 1250 observations. To check the accuarcy, we choose to use the function yardstick to calculate it:

yardstick::accuracy(confusion_matrix)

As the above result shows, the model could predict 93 percent of the 4 and 5 star rated reviews.

5 Conclusion

In this project we analysied a dataset of 5.000 reviews from Amazone. First, we performed a Network Analyse were we got some insights about the connections of the words in the reviews. Next, we moved on to the NPL were we got insight in the patterns inside the dataset. In the sentiment analysis we were showed which words were positive and negative by the Bing lexicon. By using the Afinn lexicon the positive and negative were weighted.

Lastly we used Maschine Learning to predict wether a review were 4/5 stars. We got an accuracy of 93% which is pretty high. But as we saw the most reviews were given 4 or 5 stars why we could expect a high accuracy.